This is the template for the third exercise. The purpose of this exercise is to familiarize yourself with the basics of unsupervised learning by using the agglomerative hierarchical clustering and k-means clustering algorithms to find patterns.
The data set utilised in this exercise is a simplified and downsampled version of a knowledge discovery and data mining competition data set. The data will be available on the course's Moodle page. For those who are interested, the original data can be found at https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. However, please make sure to use the version on Moodle as ex3_network_data.csv. The data is described below.
The data set contains samples of network activity simulated in a military network environment. There are different types of malicious activity, and also activity that is considered normal. It is not necessary to understand the details of the data set in order to complete the exercise.
In addition to normal activity, there are 3 types of malicious activity - denial of service, unauthorized remote login, and network probing (e.g. port scanning) - simulated in a military network environment. There are 500 samples of each class. There are 6 numeric features, described below:
src_bytes: number of bytes from source to destination
dst_bytes: number of bytes from destination to source
duration: length of connection (seconds)
count: number of connections to the same host as the current connection in the past two seconds
serror_rate: percentage of connections that have SYN errors
rerror_rate: percentage of connections that have REJ errors
In real applications, visualizing and cleaning the data are important steps. However, in this exercise you can treat the data as given, and focus on the unsupervised methods.
Please consider the following things when returning your notebook:
As in the two previous exercises, the grading scale is failed/passed/passed with honors.
For a passing grade each part of the exercise, except for the BONUS, must be completed, and all questions should be answered. Some mistakes are allowed as long as you clearly attempt to solve all the exercises.
For doing both the exercise and the optional bonus task sufficiently well, you will be awarded one bonus point for the exam.
All the cells in the finished notebook should run without crashing. Please delete unnecessary cells. As a good rule of thumb, use "Restart and run all" on the finished notebook to make sure it runs without errors and produces the expected output.
Remember to comment your code to explain how it works and what you intend for it to do.
Answer the questions asked in the assignments in Markdown cells.
If you are having problems with this exercise, try an online search first, but don't just copy-paste any code you find. See exercise guidelines in the Moodle page of this course. If you can't find a solution to your problem, ask for advice in the course discussion forum on Moodle or contact oskari.s.heikkinen@utu.fi.
If/when you look things up during this exercise, please cite your sources (e.g. a link to a web page). It's better to cite too much than too little.
The below libraries are sufficient to complete the exercise. You can import additional functionality here if you want.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import silhouette_score, adjusted_rand_score
from scipy.cluster.hierarchy import linkage, dendrogram
%matplotlib inline
# Path for the data
data_path = 'ex3_network_data.csv'
# Create a dataframe
network_data = pd.read_csv(data_path)
network_data.head()
| src_bytes | dst_bytes | duration | count | serror_rate | rerror_rate | class | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 223 | 1.0 | 0.0 | denial_of_service |
| 1 | 1032 | 0 | 0 | 511 | 0.0 | 0.0 | denial_of_service |
| 2 | 520 | 0 | 0 | 511 | 0.0 | 0.0 | denial_of_service |
| 3 | 1032 | 0 | 0 | 510 | 0.0 | 0.0 | denial_of_service |
| 4 | 520 | 0 | 0 | 448 | 0.0 | 0.0 | denial_of_service |
network_data.dtypes
src_bytes int64 dst_bytes int64 duration int64 count int64 serror_rate float64 rerror_rate float64 class object dtype: object
network_data.describe()
| src_bytes | dst_bytes | duration | count | serror_rate | rerror_rate | |
|---|---|---|---|---|---|---|
| count | 2000.000000 | 2000.00000 | 2000.000000 | 2000.000000 | 2000.000000 | 2000.00000 |
| mean | 402.324500 | 382.56600 | 20.246500 | 111.158000 | 0.059415 | 0.14014 |
| std | 464.815917 | 861.81793 | 241.867267 | 198.399713 | 0.234554 | 0.34661 |
| min | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 0.000000 | 0.00000 |
| 25% | 8.000000 | 0.00000 | 0.000000 | 1.000000 | 0.000000 | 0.00000 |
| 50% | 286.500000 | 0.00000 | 0.000000 | 2.000000 | 0.000000 | 0.00000 |
| 75% | 748.000000 | 146.00000 | 0.000000 | 105.250000 | 0.000000 | 0.00000 |
| max | 4703.000000 | 4982.00000 | 4776.000000 | 511.000000 | 1.000000 | 1.00000 |
network_data['class'].value_counts()
denial_of_service 500 normal 500 probe 500 unauthorized_remote_login 500 Name: class, dtype: int64
Perform z-score standardization on the features to ensure that all features have the same scale.
Project the data to two dimensions by using principal component analysis (PCA) and visualize the resulting two-dimensional data in a scatter plot. Don't color the scatter plot yet.
Does it look like there are clear clusters? Don't worry if they're hard to see.
#Standardizatio
#I copied this from my own exercise2 anwers
#initialize
network_numeric_features = network_data.select_dtypes(include='number')
numeric_features = list(network_numeric_features.columns)
#scale
scaler = StandardScaler().fit(network_numeric_features)
network_numeric_features_scaled = scaler.transform(network_numeric_features)
# make dataframe again
network_numeric_features_scaled = pd.DataFrame(network_numeric_features_scaled)
# insert column names back
network_numeric_features_scaled.columns = numeric_features
# change feature_train numeric values to standardized
network_data_copy = network_data.copy()
network_data_copy[numeric_features] = network_numeric_features_scaled
network_data_scaled = network_data_copy
# display
display(network_data_scaled.head()) #all data
display(network_numeric_features_scaled.head()) #just numeric values
| src_bytes | dst_bytes | duration | count | serror_rate | rerror_rate | class | |
|---|---|---|---|---|---|---|---|
| 0 | -0.865773 | -0.444017 | -0.08373 | 0.563862 | 4.011105 | -0.404417 | denial_of_service |
| 1 | 1.355016 | -0.444017 | -0.08373 | 2.015840 | -0.253374 | -0.404417 | denial_of_service |
| 2 | 0.253229 | -0.444017 | -0.08373 | 2.015840 | -0.253374 | -0.404417 | denial_of_service |
| 3 | 1.355016 | -0.444017 | -0.08373 | 2.010798 | -0.253374 | -0.404417 | denial_of_service |
| 4 | 0.253229 | -0.444017 | -0.08373 | 1.698219 | -0.253374 | -0.404417 | denial_of_service |
| src_bytes | dst_bytes | duration | count | serror_rate | rerror_rate | |
|---|---|---|---|---|---|---|
| 0 | -0.865773 | -0.444017 | -0.08373 | 0.563862 | 4.011105 | -0.404417 |
| 1 | 1.355016 | -0.444017 | -0.08373 | 2.015840 | -0.253374 | -0.404417 |
| 2 | 0.253229 | -0.444017 | -0.08373 | 2.015840 | -0.253374 | -0.404417 |
| 3 | 1.355016 | -0.444017 | -0.08373 | 2.010798 | -0.253374 | -0.404417 |
| 4 | 0.253229 | -0.444017 | -0.08373 | 1.698219 | -0.253374 | -0.404417 |
# This is done based on TKO_7093 Statistical Data Analysis courses PCA example code
# quantitative features (standardized)
network_numeric_features_scaled
# create PCA model
pca = PCA().fit(network_numeric_features_scaled)
# transform data into new space
network_numeric_pca = pd.DataFrame(pca.transform(network_numeric_features_scaled))
# visualise the first two PCA components
network_numeric_pca.plot.scatter(0, 1, figsize=(12,10))
# add transformed data back to original data frame
network_data_scaled_extended = pd.concat([network_data_scaled, network_numeric_pca], axis=1)
I can see three clear clusters. Two thinish clusters in the bottom left corner and one bigger cluster spanning to the top. These stand out here but their relationship is not clear from this clustering.
Because clustering is an unsupervised learning method, the class column is completely unnecessary for most of these tasks. You will only need the class column in Part 4, where it's used to compute a performance metric and to visually compare clustering results to the classes.
Cluster the data into 4 clusters using agglomerative hierarchical clustering. Try different values for the "linkage" parameter.
Use the z-score standardized 6-dimensional data for clustering - not the principal components!
What is the significance of the linkage criterion in a hierarchical clustering algorithm?
Evaluate the clustering performance for each linkage criterion using a metric called "silhouette score".
What does silhouette score quantify and how is it computed?
#https://www.youtube.com/watch?v=RjyR1IRm9VA
link = ['ward', 'complete', 'average', 'single'] #linkage types
for l in link:
#initialize model
clustering = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage=l)
#fit
clustering.fit(network_numeric_features_scaled)
#silhouette score
membership = clustering.labels_
print(l, ": ", silhouette_score(network_numeric_features_scaled, membership))
ward : 0.6299185426636938 complete : 0.48157598037192545 average : 0.6241907400676748 single : 0.7288757664397134
What is the significance of the linkage criterion in a hierarchical clustering algorithm?
https://statsandr.com/blog/files/Hierarchical-clustering-cheatsheet.pdf
Linkage criterion is a mesure of similarity between clusters.
What does silhouette score quantify and how is it computed?
https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c#:~:text=Silhouette%20Coefficient%20or%20silhouette%20score%20is%20a%20metric%20used%20to,each%20other%20and%20clearly%20distinguished.
Silhouette score represents if the clusters are clearly apart from eachother and well distinguished.
average and complete linkage, and plot the dendrograms using scipy.cluster.hierarchy.dendrogram).x = 1
link = ['complete', 'average'] #linkage types
fig, axs = plt.subplots(1, 2, figsize=(16, 7))
for l in link:
linkage_data = linkage(network_numeric_features_scaled, l)
plt.subplot(1,2,x)
dendrogram(linkage_data, truncate_mode='level', p=3)
x = x + 1
plt.title(l)
plt.show()
How do you interpret the dendrograms? How do they differ?
https://wheatoncollege.edu/wp-content/uploads/2012/08/How-to-Read-a-Dendrogram-Web-Ready.pdf
The vertical lines lengths tell us how different the clusters are from each other. Using the complete linkage seems to be doing a better job at the clustering as the lengths are longer. This is at least when observing the third level. The average seems to be making more distinct clusters on the lover levels we see in the dendrograms.
# # This is done based on TKO_7093 Statistical Data Analysis courses clustering example code
# create k-means model with four clusters
kmeans = KMeans(n_clusters=4).fit(network_numeric_features_scaled)
kmeans_labels = kmeans.labels_
print("kmeans silhouette score: ", silhouette_score(network_numeric_features_scaled, kmeans_labels))
kmeans silhouette score: 0.634040977469723
for x in range(2,50):
kmeans = KMeans(n_clusters=x).fit(network_numeric_features_scaled)
kmeans_labels = kmeans.labels_
print(x ,": ", silhouette_score(network_numeric_features_scaled, kmeans_labels))
2 : 0.4247796965403217 3 : 0.5419542961632046 4 : 0.634040977469723 5 : 0.6581598820518922 6 : 0.7547927328720325 7 : 0.7727553051521152 8 : 0.7870033283213129 9 : 0.7686670126482058 10 : 0.748197467722064 11 : 0.7688513353507218 12 : 0.7563718309946389 13 : 0.7669658431130512 14 : 0.7860679698072327 15 : 0.7782966744801677 16 : 0.7861532672922883 17 : 0.8027367280714405 18 : 0.8011071216944013 19 : 0.802489199690086 20 : 0.8012209292782488 21 : 0.8004592398541462 22 : 0.8000588459512196 23 : 0.8018664437698471 24 : 0.8012506351117956 25 : 0.8034603698393512 26 : 0.8031815584520139 27 : 0.8089981677081105 28 : 0.810056665229098 29 : 0.805090013073449 30 : 0.8111623579893673 31 : 0.8099840544161045 32 : 0.8052821894291811 33 : 0.8020833340665822 34 : 0.8108935605496944 35 : 0.8185883852263796 36 : 0.793636784997507 37 : 0.8015330495521672 38 : 0.8001217550377581 39 : 0.8112948457888817 40 : 0.8105451076725799 41 : 0.8078645058836601 42 : 0.8114920403740162 43 : 0.8105834189033932 44 : 0.8125227591188027 45 : 0.8084332871923232 46 : 0.8083716682446531 47 : 0.806828492185299 48 : 0.8052740944777926 49 : 0.8114277635088372
Experiment with some other numbers of clusters. Does the data fit better into a different number of clusters according to silhouette score? Adding more clusters doesn't seem to make the score better at the long run. Values seem to stay in the range of 0.785-0.815. 14 clusters is the first one to give a value between this range.
Rand score is a measure of similarity between two partitions of a set of elements - in this case true classes and clusters found by the clustering algorithm - and it is one of the most frequently used performance metrics for clustering. It is computed by considering each pair of elements in the dataset and counting pairs of elements as follows:
a: number of pairs such that the elements are in the same class and in the same cluster
b: number of pairs such that the elements are in different classes and in different clusters
c: number of pairs such that the elements are in the same class but in different clusters
d: number of pairs such that the elements are in different classes but in the same cluster
Given a, b, c, d, the formula for rand index is:
rand_index = (a+b)/(a+b+c+d).
"Adjusted Rand index" is corrected for chance by using maximum and expected values of Rand index.
adj_rand_index = (rand_index - expected_rand_index) / (max_rand_index - expected_rand_index)
clustering = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='single')
clustering.fit_predict(network_numeric_features_scaled)
labels = pd.DataFrame(clustering.labels_)
labels.columns = ['AggLabels']
# This is done based on TKO_7093 Statistical Data Analysis courses PCA example code
# quantitative features (standardized)
network_numeric_features_scaled
# create PCA model
pca = PCA().fit(network_numeric_features_scaled)
# transform data into new space
network_numeric_pca = pd.DataFrame(pca.transform(network_numeric_features_scaled))
# add transformed data back to original data frame
network_data_scaled_extended = pd.concat([network_data_scaled, network_numeric_pca], axis=1)
network_data_scaled_extended['class'] = pd.Categorical(network_data_scaled_extended['class'])
#agglomarative clustering
network_data_scaled_ext = pd.concat([labels, network_numeric_pca], axis=1)
network_data_scaled_ext['AggLabels'] = pd.Categorical(network_data_scaled_ext['AggLabels'])
network_data_scaled_ext.head()
| AggLabels | 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | -0.784767 | -2.372303 | 3.040695 | -0.741587 | 1.218450 | -0.010308 |
| 1 | 0 | 2.056440 | -1.222154 | -0.698612 | 0.176703 | 0.232087 | -0.189304 |
| 2 | 0 | 1.306228 | -1.295604 | -0.535884 | 0.155114 | -0.036722 | -0.928571 |
| 3 | 0 | 2.053927 | -1.219385 | -0.697587 | 0.176456 | 0.230843 | -0.186342 |
| 4 | 0 | 1.147854 | -1.121159 | -0.471296 | 0.139573 | -0.115129 | -0.742002 |
import plotly.express as px
# plot pca with actuall class colors
fig = px.scatter(network_data_scaled_extended, x=0, y=1, color=network_data_scaled_extended['class'])
fig.show()
# plot pca with agglomerative clustering labels
fig = px.scatter(network_data_scaled_ext, x=0, y=1, color=network_data_scaled_ext['AggLabels'])
fig.show()
# adjusted Rand score
adjusted_rand_score(network_data_scaled_extended['class'], network_data_scaled_ext['AggLabels'])
0.00011807828578910034
Based on the visual representation and rand score the labeling is bad. Here we see two complitely different clusterings.
# # This is done based on TKO_7093 Statistical Data Analysis courses clustering example code
# create k-means model with four clusters
kmeans = KMeans(n_clusters=4).fit(network_numeric_features_scaled)
# get memberships of data points in clusters
predictions = kmeans.predict(network_numeric_features_scaled)
# add memberships back to original data frame
network_data_scaled_copy = network_data_scaled_extended.copy()
network_data_scaled_copy['predictions'] = pd.Categorical(predictions)
network_data_scaled_copy.head()
| src_bytes | dst_bytes | duration | count | serror_rate | rerror_rate | class | 0 | 1 | 2 | 3 | 4 | 5 | predictions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.865773 | -0.444017 | -0.08373 | 0.563862 | 4.011105 | -0.404417 | denial_of_service | -0.784767 | -2.372303 | 3.040695 | -0.741587 | 1.218450 | -0.010308 | 3 |
| 1 | 1.355016 | -0.444017 | -0.08373 | 2.015840 | -0.253374 | -0.404417 | denial_of_service | 2.056440 | -1.222154 | -0.698612 | 0.176703 | 0.232087 | -0.189304 | 0 |
| 2 | 0.253229 | -0.444017 | -0.08373 | 2.015840 | -0.253374 | -0.404417 | denial_of_service | 1.306228 | -1.295604 | -0.535884 | 0.155114 | -0.036722 | -0.928571 | 0 |
| 3 | 1.355016 | -0.444017 | -0.08373 | 2.010798 | -0.253374 | -0.404417 | denial_of_service | 2.053927 | -1.219385 | -0.697587 | 0.176456 | 0.230843 | -0.186342 | 0 |
| 4 | 0.253229 | -0.444017 | -0.08373 | 1.698219 | -0.253374 | -0.404417 | denial_of_service | 1.147854 | -1.121159 | -0.471296 | 0.139573 | -0.115129 | -0.742002 | 0 |
# plot pca with actuall class colors
fig = px.scatter(network_data_scaled_extended, x=0, y=1, color=network_data_scaled_extended['class'])
fig.show()
# plot pca with agglomerative clustering labels
fig = px.scatter(network_data_scaled_copy, x=0, y=1, color=network_data_scaled_copy['predictions'])
fig.show()
# adjusted Rand score
adjusted_rand_score(network_data_scaled_extended['class'], network_data_scaled_copy['predictions'])
0.2951546107072655
Which performs better according to the adjusted Rand score?
kmean was much better at clustering compared the agglomerative clustering. This can be seen from the visual as well as the rand score. The Rand score still is not good.
In this task, you are working with data where the classes are not available, given as ex3_seeds_data_BONUS.csv on Moodle. The original data set is available for for those who are interested, but use the slightly modified data on Moodle instead.
In general this is a very challenging and open-ended type of task that requires in-depth domain knowledge for meaningful results. Note, however, that in this exercise you are not required to research the domain in question (e.g. properties of different varieties of wheat). You might need to search for more information related to clustering in order to complete this exercise.
Some of the questions are open-ended and have no correct answer. It's enough to clearly show that you thought about the questions.